Structural predictions of protein–DNA binding: MELD-DNA

Abstract Structural, regulatory and enzymatic proteins interact with DNA to maintain a healthy and functional genome. Yet, our structural understanding of how proteins interact with DNA is limited. We present MELD-DNA, a novel computational approach to predict the structures of protein–DNA complexes. The method combines molecular dynamics simulations with general knowledge or experimental information through Bayesian inference. The physical model is sensitive to sequence-dependent properties and conformational changes required for binding, while information accelerates sampling of bound conformations. MELD-DNA can: (i) sample multiple binding modes; (ii) identify the preferred binding mode from the ensembles; and (iii) provide qualitative binding preferences between DNA sequences. We first assess performance on a dataset of 15 protein–DNA complexes and compare it with state-of-the-art methodologies. Furthermore, for three selected complexes, we show sequence dependence effects of binding in MELD predictions. We expect that the results presented herein, together with the freely available software, will impact structural biology (by complementing DNA structural databases) and molecular recognition (by bringing new insights into aspects governing protein–DNA interactions).


INTRODUCTION
Understanding and predicting how proteins and nucleic acids interact is key to deciphering the mechanisms regulating gene expression, genome repair and storage, with applications in fields such as nanomedicine (1), transition metal chemistry (2) or clinical diagnosis (3), to name a few. The prediction of such molecular recognition problems typically falls under molecular docking, machine learning and molecular dynamics-based studies, which have been broadly successful in understanding how proteins recognize small molecules, peptides and other proteins (4)(5)(6). However, the nature of protein-DNA interactions introduces several nuances that have challenged standard approaches. First, whereas proteins come in diverse shapes and sizes, the double-stranded B-DNA structure is common to most DNA sequences, leading for many years to the concept of proteins 'reading' DNA. Second, the highly charged interactions from the repeating phosphate backbone lead to a particular protein interacting with high affinity with many different DNA sequences. Despite the high affinity, specificity (7) for different sequences can span several orders of magnitude, leading to a preferential binding for certain sequences. Transcription factor (TF) proteins regulate gene expression by binding DNA sequences between 6 and 12 bp in length, statistically found thousands to millions of times along the genome. Still, most of these sites are never experimentally occupied (8,9). TF binding has received particular attention due to its biological roles; thus, we will focus on those. Understanding TF protein-DNA interactions requires answering three distinct questions: (i) what structure will a particular protein-DNA complex adopt; (ii) what sequence will a particular TF preferentially bind (e.g. where along the genome will the complex form); and (iii) what are the relative binding affinities for different sequences? Pipelines combining elements from structural databases, molecular dynamics, docking and machine learning are becoming more prominent to address some (or all) of the above questions (10,11).
While the initial views on protein-DNA recognition focused on the ability of proteins to 'read' the DNA sequence and find the best binding site, current views show that DNA has a much more prominent role in recognition (9,12). Thus, some proteins interact with DNA through sequence-specific interactions, but many interact through shared DNA features (e.g. the phosphate backbone). The difference in binding patterns leads to binding mechanisms lying between two extremes: sequence readout and shape readout (9,(13)(14)(15)(16). The first is governed by specific interactions, while the second accounts for the ability of a particular DNA sequence to adopt conformations compatible with the structure of the complex. Attempts at combining structural descriptors (e.g. average groove widths, propeller twist or roll) combined with sequence preferences derived from genomic studies (7,(17)(18) significantly improve predictions of where TF proteins bind along the genome. However, attempts to use detailed structural approaches, such as free energy perturbation or energy decomposition, to distinguish shape and sequence contributions to binding are not always successful, limiting their general application (19)(20)(21). Although the reason for failures are unclear, factors such as the amount of DNA deformation in different TFs, the number of binding modes contributing to the binding free energy and force field accuracy are suspected to be involved (21).
We require initial structures of the complex for many studies involving an understanding of binding affinity predictions, binding mechanisms or even DNA-mediated allostery. However, there is a lack of experimental structural data and few computational methods to predict such complexes accurately. For instance, there are 6052 protein-DNA structures in the Nucleic Acid Database, compared with 171 077 protein structures in the Protein Data Bank (PDB) as of October 2022 (22,23). Such small datasets pose challenges to adapting protein structure prediction approaches [e.g. AlphaFold (24)] to the problem of predicting nucleic acids and their complexes. The recent RoseTTAFoldNA (25) machine learning approach compensates for the smaller datasets (they report 1556 protein-nucleic acid complex clusters compared with 26 128 all protein clusters) by incorporating physics-based parameters (e.g. van der Waals terms) originating from Rosetta. Despite the advances this method represents, especially in RNA structure prediction, it requires further development to correct anomalies in predicting structures of homodimers bound to DNA (prevalent in TFs), which leads to an overlap between monomer units and incorrect binding modes.
For many decades, docking has been the most efficient technique for predicting atomic resolution structures of macromolecular binding. (26) Docking is an invaluable tool for the virtual screening of small molecule libraries in the early stages of drug discovery (27) thanks to the balance between computational efficiency and accuracy. For larger assemblies, community efforts such as the Critical Assessment of PRediction of Interactions (CAPRI) (28) have led to significant improvements in the field. However, the accuracy of docking drops rapidly for systems involving conformational changes and in charged systems, where scoring functions are less reliable. Thus, docking methods will often combine with strategies such as normal modes or a later stage using molecular dynamics to overcome limitations in scoring functions (29,30). The most successful efforts come from the High-Ambiguity Driven Docking (HAD-DOCK) (31,30) group combining docking with ambiguous data and DNA flexibility (via normal modes). They have led to helpful benchmark sets of different difficulties for assessing protein-DNA docking performance (32).
Knowing the structure of the complex opens up the possibility of predicting relative binding affinities for different sequences. Alchemical free energy methods based on molecular dynamics simulations are the gold standard for predicting relative (and absolute) binding affinities for small molecules (33). However, they present limitations when dealing with electrostatic charge variations and flexible complexes where several binding modes can contribute to the binding free energy (34). Despite some successes for protein-DNA systems (35,20), a recent systematic study (21) on protein-DNA complexes points to deficiencies of these approaches arising from either phase space overlap or force field issues. Thus, current methods based on physics and statistical potentials such as Rosetta (36) or machine learning (37)(38)(39)(40) far outperform traditional molecular dynamics-based approaches in speed and accuracy.
In this study, we introduce a framework based on alchemical free energy methods that incorporates information via Bayesian inference to predict structure-sequence relationships to increase our understanding of how proteins bind nucleic acids. The approach, MELD (Modeling Employing Limited Data) (41,42), has been previously used for predicting protein structures and their complexes with small molecules, peptides and proteins. Here we extend the framework to DNA (MELD-DNA), with different applications depending on the type of data used. This work exemplifies three of the most common uses: (i) predicting structures of protein-DNA complexes; (ii) identifying sequence sensitivity; and (iii) predicting relative binding affinities. We expect the first application to be the most broadly used and thus present a generalized protocol over 15 different proteins. We make more specific comparisons over a series of 15 sequences for three of those systems. Finally, we exemplify the application of relative binding affinities for six sequences binding a particular TF. The current work represents an extensive simulation study with an aggregated 1 ms of sampling. Our results show that MELD-DNA successfully: (i) samples multiple binding modes; (ii) identifies the experimental binding mode through clustering the ensembles; and (iii) is sensitive to DNA sequences and conformations.

General approach
We use the MELD Bayesian inference approach [p(x|D) ␣ p(D|x) · p(x)] to incorporate ambiguous and noisy data to enhance binding/unbinding events (41,42). The prior distribution [p(x)] is given by the Boltzmann distribution based on the chosen force field, while the likelihood [p(D|x)] comes from the agreement of the sampled conformations (x) with a subset of the data (D, the one with the lowest restraint energy). As MELD samples the energy landscape, different subsets of data are also explored, exploiting regions compatible with some subset of data, and the force field (41,42) gives rise to the posterior distribution [p(x|D)]. In practical terms, MELD uses a Hamiltonian and Temperature replica exchange molecular dynamics approach in which some replica conditions are compatible with unbound states and some with bound states. As 'walkers' sample different conditions in the replica ladder, they go through cycles of binding and unbinding. We identify bound states by clustering the lowest temperature ensembles, where each cluster represents a different binding mode and is compatible with varying subsets of data. We will showcase here three protocols to address three questions: (i) general binding (applied to any protein-DNA system); (ii) specific binding (applied to many DNA sequences binding a particular protein where additional information is known); and (iii) relative binding affinities. The type of data used to guide simulations depends on the questions we ask. Examples are accessible from Zenodo (see Data Availability section). MELD simulations use 30 replicas, the parmBSC1 force field for nucleic acids (43)(44)(45), the ff14SB side force field for the protein (46,47) and the GBneck2Nu implicit solvent model (48,49). Throughout all protocols, we include restraints to keep the protein and DNA from unfolding at high temperatures. For proteins, we enforce secondary structure and flat-bottom harmonic restraints on native C␣-C␣ contacts; the initial coordinates for simulations and to set up the restraints are based on its bound conformation. For DNA, we implement restraints that maintain hydrogenbonding patterns at each base pair to prevent DNA melting. All simulations were initialized with the protein far away (at least 30Å) from the DNA. The initial DNA conformation is generated in its canonical B-form based on the sequence (50).

Protocol 1: posing general knowledge in terms of ambiguous data drives protein-DNA structure prediction
In this approach, we seek to explore multiple binding modes and rely on statistical mechanics to identify the most nativelike one (e.g. the most compatible with our physical model). We presume knowledge of (i) the protein structure, (ii) the DNA sequence to bind and (iii) the DNA-binding domain. We generate a B-DNA structure using Chimera (50) and create a dummy atom at the N1 position of each purine base (see Figure 1). We define the binding data as the possible interactions between C␣ atoms in the binding domain and the N1 atoms ( Figure 1A). We produce a list of potential contacts, where only some might be satisfied during binding (noisy data) ( Figure 1B). We reduce the amount of possible combinatorics by taking into consideration geometric considerations (e.g. residues far away in the binding site are unlikely to interact with the same DNA base simultaneously). Clustering on the MELD ensemble, we identify native-like poses in the ensemble (see Figure 1). The current set-up has two advantages: (i) by using dummy particles at the N1 site, we do not favor the protein approaching through either major or minor groove orientations and (ii) the information added is not exhaustive of all possibilities--and it does not need to be, as the force field will sample the most likely conformations given the available data ( Figure 1C). We chose 15 protein-DNA systems to apply this approach (see Table 1). The systems include complexes with little or no deformation of the DNA from its canonical B-DNA form, others that induce moderate deformation upon binding and complexes where the DNA is far from its canonical B-DNA conformation. The dataset also contains systems that have been solved experimentally with two different sequences, resulting in binding mode variations (e.g. different spacing between binding domains: 1R4R and 1R4O). Finally, we include two types of systems intended to challenge our approach: binding occurs through either flexible (disordered) tails (1ZME) or where large conformational changes are needed for accessing the binding site (1BGB and 2B0D).

Protocol 2: system-specific binding protocols tease out sequence effects
In this scenario, we have knowledge of the bound state (and binding mode) and use this information to guide to repeating binding sites along a DNA oligomer (see Figure 2). By using oligomers with different sequences (but the same driving information), we are interested in teasing out the drivers of binding (sequence or shape readout). When the data are too constraining, we should see the same binding mode regardless of the sequence. We compare multiple sequences for three systems involving different degrees of DNA bending upon binding (see Table 2).

Protocol 3: competitive binding simulations quantify relative binding affinities
The data used in this protocol are the most constraining: we aim to distinguish the preference for a particular known binding mode to two different sequences. Thus, the data are compatible with one binding mode. Rather than defining this strictly as in docking or an alchemical free energy calculation, there is still enough freedom to sample widely inside the basin corresponding to this binding mode. To compare across systems, we have previously used a competitive binding strategy (65) in which the relative binding affinity can be determined by counting the population of the protein-DNA complex for each sequence. Converging populations to obtain statistical significance makes this protocol more computationally demanding than the previous two. We thus exemplify this protocol on the six sequences binding P22 shown in Table 2.

HADDOCK docking predictions
We used the bio3d R module to calculate normal modes of the minimized canonical B-DNA and minimized protein for each system (66,67). The ensemble of normal modes for each binding partner was fed into the HADDOCK web server (68,69), specifying identical residues involved in the contact lists of MELD as active residues. We then analyzed the top five clusters and all the models generated by the docking package.

RosettaFold2NA machine learning predictions
The RosettaFold2NA (RF2NA) program was assembled following their GitHub walkthrough (25). The FASTA sequence of each protein chain and DNA strand was provided to the program as separate files according to the instructions. Each prediction returned one structure.

MELD-DNA is successful at predicting protein-DNA complexes
We analyze our ensembles for each of the 15 proteins by asking two questions (see Figure 3): (i) can the method sample the native state and (ii) can we identify the native state with high confidence without knowing the actual structure? We find that in 13 of the 15 cases the native state is present in the ensemble, and in 11 of 15 cases it belongs to a high population cluster (present in the top five clusters by population). The ensembles represent multiple binding modes, highlighting MELD's ability to explore various bound conformations. Supplementary Figures S1-15 represent the distribution of structures sampled at every replica in the MELD   Figure S16). Similarly, most leucine zippers correctly identify the binding mode, with the binding domains overlapping the experimental binding mode. For these long coiled-coil structures, small fluctuations near the binding site give rise to a large displacement at the ends--hence some of these structures look visually distinct from the experimental structure but retain a low interface RMSD (e.g. 2GDG in Figure 3). Despite the use of restraints to keep the protein and DNA from unfolding, we find that both macromolecules sample a broad ensemble of conformations. Typically, the DNA and protein ensembles approach the holo conformation as the native binding mode is sampled (see Supplementary Figure S17). We examined predictions from HADDOCK and RF2NA for comparison. For HADDOCK, we first tried submitting only one DNA and protein structure as input Figure 3. Superposition of the best in top five clusters of MELD simulations against the experimental structure. We report the interface RMSD of the top cluster, the best cluster amongst the top five clusters and the best structure in the ensemble. A prediction is marked as a success if we can find a <5Å conformation in the top five clusters.
for each complex. As expected, in this scenario, we did not observe any native-like predicted complexes (which MELD also fails to predict in the absence of data). We then used several structures for both the protein and DNA sampled along their normal modes as calculated by the bio3D package (66,67,70). HADDOCK returns 200 models of the complex by default and clusters them. We observed many native-like complexes amongst the 200 generated models (Table 3; Supplementary Figures S18 and S19). However, the top five clusters generally had higher interface RMSDs and a lower fraction of native contacts (71,72) than MELD predictions. In our hands, the RF2NA machine learning approach produced overlaps between the two monomers in protein dimers where both bind in the same DNA region (marked as MO in Supplementary Table  S1). For 1BY4, it used the two monomers of the protein to form a large monomer-like protein binding only on one of the two binding sites of the DNA. We are optimistic that future versions will be available to predict structures of homodimers binding DNA.

MELD is sensitive to sequence properties
We chose three systems representative of binding DNA in its canonical conformation, bent DNA and significantly bent DNA (bZIP, P22 and TATA box-binding proteins, respectively). For each of the systems, we simulated the consensus sequence, for which we have an experimental structure of the complex. We then generated the sequences defined in Table 2, for which we have no experimental data. The consensus sequence is the highest affinity binder of all sequences. However, other sequences are also likely to bind based on charge complementarity--especially in cases with no specific interactions between protein residues and DNA nucleobases. Hence, we expect differences in the ensemble, either by binding at different sites along the sequence or by affecting the populations identified by clustering with respect to the consensus sequence (see Figure 4).
For all three cases, MELD correctly identifies the native binding mode when the consensus sequence is used; however, we differentiate two different behaviors when chang-Nucleic Acids Research, 2023, Vol. 51, No. 4 1631  Any two residues with at least one heavy atom pair within 5Å in the experimental structure were defined as a contact. Instances where ≥70% of the native contacts are satisfied are highlight in bold ing the sequence. For P22 and TATA box-binding protein, binding is driven by shape complementarity, and, in all simulations, we observe a very similar binding pattern, with all DNA sequences adapting bent conformations. However, for bZIP, which binds DNA in its canonical B-DNA conformation, we observe dramatic shifts in cluster populations and binding preferences, with some sequences preferring to bind different sites along the sequence (see Figure 4A). To further study the sequence/structure relationship, we took the top two clusters from unbiased molecular dynamics simulations of each protein-free DNA sequence, and we performed binding simulations restraining the conformation of each sequence to each of the possible clusters (a total of 50 simulations). Supplementary Figure S20 shows the RMSD of the protein versus the protein-DNA interface RMSD. While some structure/sequence combinations lead to binding, others do not. For the consensus sequence, the experimental binding mode is sampled in high populations in 9/10 simulations, with the highest population cluster falling in this region most of the time. Other sequences have a lower preference for the canonical binding site despite starting from the same conformation and using the same data. Conformation random1-c1 (random1 sequence and cluster 1 conformation) merits a special mention: only using the sequence from random1 does this conformation allow sampling of the experimental binding mode. For this structure, we find a significant amount of binding through the minor groove in most sequences; however, for Random1 se-quence, the most prevalent binding mode is through the major groove, as in the experimental structure. Thus, we conclude that the physical model used in MELD is sensitive to sequence-dependent properties. Despite this, in cases where the DNA undergoes significant conformational changes, where we expect shape readout to drive binding, the MELD restraints overcome sequence preferences. It is difficult to directly compare the effect of the restraints as independent MELD binding simulations are not comparable (73). We thus complement these results with relative binding affinity calculations (protocol 3).

Relative binding affinities identify structure/sequence preferences during binding
Here two DNA duplexes and a protein are simulated together (see the Materials and Methods), promoting protein binding to either DNA duplex at low replicas while restricting interactions between the duplexes. During unbinding events, the protein is far from both DNA duplexes. We assess the higher affinity binders by counting how often the protein binds each duplex. In this approach, the binding mode is known; thus, MELD uses more constraining information to favor faster binding/unbinding events, leading to higher statistical significance. We first tested simulations in which we competed the consensus sequence against itself. The expected outcome is that the protein should bind equally to each sequence and is thus a test of the expected errors as well as possible systematic errors arising from the set-up conditions. The observed ratio of 57/43 populations was close to the expected 50/50 value (see Figure 5B), giving us confidence that the set-up conditions were not biased towards one of the two DNA structures. Competing the consensus against the other five sequences showed that the consensus sequence was preferred in all cases.
The advantage of simulation tools is that we can decouple sequence and shape readout by freezing the DNA structure in either its apo or holo conformation--similar to a rigid docking experiment, where the DNA can only fluctuate around the initial structure. We expected all relative free energies to increase when freezing DNA structures to the holo conformation as the conformational free energy change to the binding free energy was 'pre-paid.' Interestingly, while it increased in most cases (see Figure 5D), it decreased in others, most significantly for the random sequence. We further analyzed this sequence by fixing either the consensus or random sequence to the holo/apo conformation. Surprisingly, the protein recognizes the random sequence preferentially when both structures are kept in their B-DNA form but the consensus sequence when they are in their holo conformation. Thus, simulations that include flexibility to deform will have more events progressing down the replica ladder when the protein binds the random sequence. Still, once they form the complex structure, the consensus remains bound for a longer time. The approach is not currently sensitive to large free energy differences--where only one structure might be bound at the lowest replica. The method can readily be made more quantitative by using information from all replicas with proper reweighting [e.g. using the multistate Bennett acceptance ratio (74)]; however, this was beyond the scope of the current work (74).

DISCUSSION
The MELD approach is designed to help answer questions involving structural and energetic considerations, drawing from DNA's sequence dependence binding preferences. Choosing the origin of the information can help in either refining structures from these methods (e.g. making predictions from these methods and generating ambiguous/noisy data to guide MELD binding), answering questions about shape/sequence readout or even relative binding affinities. The general pipeline is available on GitHub (github.com/PDNALab/MELD-DNA) to showcase the potential of this framework. Overall, the approach successfully predicts the structures of protein-DNA complexes. Understanding sequence/structure relationships and relative binding free energies is better indicated for systems where existing available information reduces the conformational search space. The trade-off between efficient sampling and the physics model to use currently limits MELD-DNA to implicit solvent, which is known to be less accurate than simulations in explicit solvent. Nonetheless, the method samples accurate atomic structures representative of the native state for most of the systems studied, even when starting from B-DNA conformations. The physics-based nature of the ensembles readily makes this a valuable approach to obtaining structures for more detailed studies (e.g. in explicit solvent), such as using different MELD clusters as seeds for adaptive sampling simulations combined with Markov state models (75,76). Thus, we believe the current framework can be a powerful tool to increase our understanding of nucleic acid complexes.
We have shown a successful strategy for identifying protein-DNA complexes sensitive to the DNA sequence, exploring multiple binding modes and which samples DNA deformation during binding. The MELD approach draws on the successes of the HADDOCK docking strategy of combining ambiguous and noisy data with the search engine. It goes beyond harmonic deformations by using a molecular dynamics engine sensitive to the DNA sequence and identifying successful structures based on a statistical mechanics treatment of the generated ensemble rather than on scoring functions. It is worth noting that we employed HADDOCK as regular users versed in structural biology, and expert users might improve the performance. Regardless, while the accuracy in HADDOCK structure predictions is lower than that of MELD, it also comes at a small fraction of the computational cost: MELD simulations require 30 GPUs typically running for about 2 days for these systems, while HADDOCK calculations take a few minutes, on a single core. HADDOCK predictions can readily be incorporated as structural seeds for each replica in MELD, as well as a source of ambiguous/noisy information. In this regime, MELD can be used to refine HADDOCK's models while simultaneously identifying the most likely model as the one most prevalent in the lowest temperature ensemble (77).
The choice of restraints, noise and ambiguity in the dataset will depend on how much data are available to the user. Higher accuracy and amount of data lead to faster convergence but reduced exploration of the binding land-scape. For most purposes of structure prediction, protocol 1 is the most transferable and generalizable. Furthermore, users might decide to change the restraints imposed on the protein and DNA. In our current approach, the DNA has more flexibility than the protein system (see Supplementary Figure S17) to account for DNA bending during binding. However, restraints between base pairs will keep structures from sampling conformations where one base is open (e.g. for binding methyltransferases). For such cases, a user would modify the restraints affecting the desired region of DNA. Similarly, the protein restraints affect the ability to sample open/closed states, which are needed for some systems where the binding site is not accessible in the closed state. Knowing the binding site, exceptions can be made on which regions of the protein to restrain.
We see three issues related to (i) accessibility of the binding site, (ii) force field accuracy and (ii) efficiency of replica exchanges. Proteins 1BGB and 2B0D require an opening event before the DNA can access the binding site; however, the standard protocol used to predict binding (protocol 1) uses distance restraints that prevent the protein from unfolding and, in this case, from accessing the open conformation state. Furthermore, force fields typically have a bias towards compact structures, especially in implicit solvent, further favoring compact closed conformations that prevent DNA binding. Some authors have also suggested a possible DNA + protein force field imbalance resulting in stronger than expected arginine and lysine interactions with the phosphates (78)(79)(80)(81). Such highly charged systems challenge the accuracy of the force field in the implicit model. This issue is best seen in our set of complexes for the TATA box-binding protein (1CDW). The bend in DNA structure induced by binding of the TATA box-binding protein is further accentuated in the implicit solvent (see Supplementary Figure S21), resulting in overly strong electrostatic interactions (see "Specific Results" in the Supplementary Information). Because of compaction and these strong electrostatic interactions, replica exchanges that favor unbinding have a lower probability than similar approaches for proteinprotein and protein-peptide binding (see Supplementary SI; Supplementary Figures S22 and S23). Thus, despite the current success, future endeavors will aim to introduce explicit solvent into the methodology.
Overall, the MELD-DNA methodology we presented herein fills a gap in computational tools that predict protein-DNA binding. We have shown that the method is sensitive to sequence and structural preferences and is thus a promising new approach to studying this type of system. The MELD code is freely available through GitHub as a plugin to openmm (82). On a diverse set of protein-DNA systems involving 15 different complexes, the method successfully predicted 10 of them as high population clusters. We believe the physics-based insights which MELD-DNA can provide will advance our understanding of protein-DNA interactions and our ability to simulate events related to supramolecular chemistry. Increasing our structural knowledge and sequence binding structural preferences combined with other factors that affect in vivo binding (e.g. chromatin state and accessibility) can bring new understanding to the molecular mechanisms that orchestrate gene regulation.